-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Series construction with EA dtype and index but no data fails #33846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Series construction with EA dtype and index but no data fails #33846
Conversation
Maybe a test in tests.extension? |
Series construction passes through
so not sure what tests could be added to extension tests since we have a code check rule: Check for the following code in the extension array base tests: and this PR is about Series construction. could maybe add more parameterisation to the Series test? Do we have a fixture for internal EA dtypes? |
ah, using tm instead of self. will add tests to extension tests and remove this IntegerArray test from pandas/tests/series/test_constructors.py |
The errors seem to indicate that we pass the scalar to the ExtensionArray constructor. But it should maybe the responsibility of the Series constructor to convert a scalar to something like |
Thats what i was thinking, yah |
That would be inconsistent with how non-EA series are constructed. Are we OK with that? IIUC the traceback in the issue OP is no longer relevant. on master..
|
Not directly related to this PR, but may offer chance of avoiding
|
actually atm this 'fix' is not 100% consistent for
in the current 'fix' a 1d EA is returned from _try_cast, so not 100% consistent with non-EA types. so maybe #33846 (comment) may need to be fixed first. L538 could be changed to continue and a construct_1d_pdarray_preserving_na created to duplicate |
pd.array is strictly 1D only, while your numpy example ( |
try_cast returns an array. with non-EA dtypes it's a 0-d array and then in sanitize_array, the whereas atm, this 'fix' returns a 1-d array (we don't want to implement 0-d arrays for EA, #33846 (comment)) and then in
we create the correct length array from the single element EA array returned from _try_cast. so if this fix is not suitable, I see a few other options.
I think that union return types (1) should be avoided, passing additional keywords (2) violates the single responsibility principle and additional special casing (3) should be avoided. any other suggestions? |
another constructor case fixed by this branch for consistency with numpy types.
on master
so could add some more test cases. |
pandas/core/construction.py
Outdated
@@ -533,6 +533,10 @@ def _try_cast( | |||
if isinstance(dtype, ExtensionDtype) and dtype.kind != "M": | |||
# create an extension array from its dtype | |||
# DatetimeTZ case needs to go through maybe_cast_to_datetime | |||
|
|||
if lib.is_scalar(arr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could look into how to identify a collection that could be considered a 'scalar' for some EA, eg JSONDtype. although I think out-of-scope for the issue that this PR attempts to fix (i.e. IntegerArray, where the scalars are scalars)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than this I would call: construct_1d_arraylike_from_scalar
but I wouldn't do this right here, rather on L453, e.g. add an elif is_scalar(data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's option 3 in #33846 (comment)
do this just for EA types and keep the code path the same for non-EA types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this will work generically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm getting a few failures in pandas/tests/series/test_constructors.py. i'll push the change anyway use the ci to see what else fails while I investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk
@pytest.mark.xfail(reason="GH-26469") | ||
def test_series_constructor_scalar_with_one_element_index(self, data): | ||
# TypeError: data type not understood | ||
super().test_series_constructor_scalar_with_one_element_index(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is some special casing for datetime in the Series construction. Although fixing this could also be considered out-of-scope for the issue that this PR attempts to close, I could look into this further and maybe raise a separate issue if not fixed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opening a separate issue is fine for me as well, either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not failing with the changes as they stand atm. the previous 'fix' was inside a if isinstance(dtype, ExtensionDtype) and dtype.kind != "M"
and hence not applicable to DatetimeTZDtype
@@ -151,6 +151,16 @@ def test_array_from_scalars(self, data): | |||
# ValueError: PandasArray must be 1-dimensional. | |||
super().test_array_from_scalars(data) | |||
|
|||
@skip_nested | |||
def test_series_constructor_scalar_with_index(self, data): | |||
# ValueError: Length of passed values is 1, index implies 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the object dtype, the scalar is a tuple, so this failure is related to #33846 (comment)
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -732,7 +732,7 @@ ExtensionArray | |||
^^^^^^^^^^^^^^ | |||
|
|||
- Fixed bug where :meth:`Series.value_counts` would raise on empty input of ``Int64`` dtype (:issue:`33317`) | |||
- | |||
- Fixed bug in :class:`Series` construction with EA dtype and index but no data fails (:issue:`26469`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or scalar data?
expected = pd.Series([scalar] * 3, index=[1, 2, 3], dtype=dtype) | ||
self.assert_series_equal(result, expected) | ||
|
||
def test_series_constructor_scalar_with_one_element_index(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe just combine this one with the test above (both are about scalar with index) ?
(and one test less to override in the subclasses)
@pytest.mark.xfail(reason="GH-26469") | ||
def test_series_constructor_scalar_with_one_element_index(self, data): | ||
# TypeError: data type not understood | ||
super().test_series_constructor_scalar_with_one_element_index(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opening a separate issue is fine for me as well, either way
@pytest.mark.xfail(reason="GH-26469", strict=False) | ||
def test_series_constructor_no_data_with_index(self, data, na_value): | ||
# ValueError: Cannot convert non-finite values (NA or inf) to integer | ||
super().test_series_constructor_no_data_with_index(data, na_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why sparse is failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, so much for the xfail. no longer failing. was only failing on only one of the fill values before hence the strict=False.
@pytest.mark.xfail(reason="GH-26469") | ||
def test_series_constructor_no_data_with_index(self, data, na_value): | ||
# pyarrow.lib.ArrowInvalid: only handle 1-dimensional arrays | ||
super().test_series_constructor_no_data_with_index(data, na_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why it is failing for this dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the traceback is
_______________________________________________ TestConstructors.test_series_constructor_no_data_with_index ________________________________________________
self = <pandas.tests.extension.arrow.test_bool.TestConstructors object at 0x000001E9918B7B50>
dtype = <pandas.tests.extension.arrow.arrays.ArrowBoolDtype object at 0x000001E9918B79A0>, na_value = None
def test_series_constructor_no_data_with_index(self, dtype, na_value):
> result = pd.Series(index=[1, 2, 3], dtype=dtype)
pandas\tests\extension\base\constructors.py:37:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\series.py:283: in __init__
data, index = self._init_dict(data, index, dtype)
pandas\core\series.py:372: in _init_dict
s = create_series_with_explicit_dtype(
pandas\core\construction.py:629: in create_series_with_explicit_dtype
return Series(
pandas\core\series.py:329: in __init__
data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
pandas\core\construction.py:459: in sanitize_array
subarr = _try_cast(data, dtype, copy, raise_cast_failure)
pandas\core\construction.py:542: in _try_cast
subarr = array_type(arr, dtype=dtype, copy=copy)
pandas\tests\extension\arrow\arrays.py:82: in _from_sequence
return cls.from_scalars(scalars)
pandas\tests\extension\arrow\arrays.py:72: in from_scalars
arr = pa.chunked_array([pa.array(np.asarray(values))])
pyarrow\array.pxi:265: in pyarrow.lib.array
???
pyarrow\array.pxi:80: in pyarrow.lib._ndarray_to_array
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E pyarrow.lib.ArrowInvalid: only handle 1-dimensional arrays
pyarrow\error.pxi:84: ArrowInvalid
hmm, seems to be because using lib.is_scalar, pa.NULL is passed to sanitize_array
>>> import pyarrow as pa
>>>
>>> pd._libs.lib.is_scalar(pa.NULL)
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche not sure if you want this fixed here. will raise a separate issue for this case in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's fine, we don't really support this dtype anyway, it's only to test certain things
pandas/core/construction.py
Outdated
@@ -533,6 +533,10 @@ def _try_cast( | |||
if isinstance(dtype, ExtensionDtype) and dtype.kind != "M": | |||
# create an extension array from its dtype | |||
# DatetimeTZ case needs to go through maybe_cast_to_datetime | |||
|
|||
if lib.is_scalar(arr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than this I would call: construct_1d_arraylike_from_scalar
but I wouldn't do this right here, rather on L453, e.g. add an elif is_scalar(data)
# RecursionError: maximum recursion depth exceeded in comparison | ||
super().test_series_constructor_scalar_na_with_index(dtype, na_value) | ||
|
||
@pytest.mark.xfail(reason="GH-26469") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a more informative message
correct, these need to be unwrapped (easiest), e.g. your check I think will work like this
|
@@ -150,6 +150,21 @@ def test_from_dtype(self, data): | |||
# construct from our dtype & string dtype | |||
pass | |||
|
|||
@pytest.mark.xfail(reason="GH-26469") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these be a new issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few checks to go but I think we need a discussion on when to allow a collection to be treated as scalar. so yes, will probably raise an issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, and just flip the references to that, otherwise lgtm. ping on green.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/construction.py
Outdated
try: | ||
data = maybe_cast_to_datetime(data, dtype) | ||
except TypeError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know for what case it raises? As I would think that maybe_cast_to_datetime
is not supposed to raise (or if it raises, it's an error that should bubble up to the user), so that might need a fix there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know for what case it raises?
it raises for string type with a string that satisfies
>>> pd.core.dtypes.common.is_datetime64_dtype("M")
True
The scalar used in extension tests is random and is sometimes, say "M", therefore a separate test has been added.
As I would think that
maybe_cast_to_datetime
is not supposed to raise (or if it raises, it's an error that should bubble up to the user), so that might need a fix there?
it raises TypeError: Cannot cast datetime64 to string
pandas/pandas/core/dtypes/cast.py
Lines 1387 to 1395 in 1a82659
elif is_datetime64_dtype(value) and not is_datetime64_dtype(dtype): | |
if is_object_dtype(dtype): | |
if value.dtype != DT64NS_DTYPE: | |
value = value.astype(DT64NS_DTYPE) | |
ints = np.asarray(value).view("i8") | |
return tslib.ints_to_pydatetime(ints) | |
# we have a non-castable dtype that was passed | |
raise TypeError(f"Cannot cast datetime64 to {dtype}") |
I suspect that maybe_cast_to_datetime shouldn't raise in this case, seems to contradict the maybe of maybe_cast_to_datetime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that's a bug in maybe_cast_datetime
. We should not check the actual value if it is a datetime64 dtype, but only if that value (eg because it being an array) has a dtype.
(that's related to passing the actual dtype vs the array to the is_..._dtype
functions, something we have been discussing lately)
Can you see if this fixes it for you:
--- a/pandas/core/dtypes/cast.py
+++ b/pandas/core/dtypes/cast.py
@@ -1384,7 +1384,7 @@ def maybe_cast_to_datetime(value, dtype, errors: str = "raise"):
pass
# coerce datetimelike to object
- elif is_datetime64_dtype(value) and not is_datetime64_dtype(dtype):
+ elif is_datetime64_dtype(getattr(value, "dtype", None)) and not is_datetime64_dtype(dtype):
if is_object_dtype(dtype):
if value.dtype != DT64NS_DTYPE:
value = value.astype(DT64NS_DTYPE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you see if this fixes it for you:
thanks.
# GH 33559 - empty index | ||
result = pd.Series(index=[], dtype=dtype) | ||
expected = pd.Series([], index=pd.Index([], dtype="object"), dtype=dtype) | ||
self.assert_series_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR appears to also close #33559
the index discrepancy is consistent with non-EA types.
>>> import pandas as pd
>>> pd.__version__
'1.1.0.dev0+1446.g1c88e6aff'
>>> pd.Series(dtype="int64", index=[]).index
Index([], dtype='object')
>>>
>>> pd.Series(dtype="int64").index
Index([], dtype='object')
>>>
>>> pd.Series([], dtype="int64").index
RangeIndex(start=0, stop=0, step=1)
>>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is another PR trying to clean this up
lgtm. can merge on green. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff